arXiv:1501.07738vl [cs.CV] 30 Jan 2015 


CO-REGULARIZED DEEP REPRESENTATIONS FOR VIDEO SUMMARIZATION 


Olivier Morere*’^’^'^, Hanlin Goh*’^’^, Antoine Veillard*’^’^, 


Vijay Chandrasekhar^'^, Jie Lin^'^ 


I2R\ UPMC2, IPAL^ 


ABSTRACT 

Compact keyframe-based video summaries are a popular 
way of generating viewership on video sharing platforms. 
Yet, creating relevant and compelling summaries for arbi¬ 
trarily long videos with a small number of keyframes is a 
challenging task. We propose a comprehensive keyframe- 
based summarization framework combining deep convolu¬ 
tional neural networks and restricted Boltzmann machines. 
An original co-regularization scheme is used to discover 
meaningful subject-scene associations. The resulting multi¬ 
modal representations are then used to select highly-relevant 
keyframes. A comprehensive user study is conducted com¬ 
paring our proposed method to a variety of schemes, includ¬ 
ing the summarization currently in use by one of the most 
popular video sharing websites. The results show that our 
method consistently outperforms the baseline schemes for 
any given amount of keyframes both in terms of attractive¬ 
ness and informativeness. The lead is even more significant 
for smaller summaries. 

Index Terms — Video summarization, deep convolu¬ 
tional neural networks, co-regularized restricted Boltzmann 
machines 


1. INTRODUCTION 

Video sharing websites measure user engagement through 
click rates and viewership. To make a novel video attractive 
for the audience, its video link is often presented as a thumb¬ 
nail of either a single representative frame or a slideshow of 
several keyframes. In this work, we explore the problem of 
automatically generating diverse, representative and attractive 
keyframe-based summaries for videos. 

Summarization-based techniques can be broadly divided 
into three categories: 1) keyframe-based, 2) skimming-based 
and 3) story-based. In keyframe-based summarization, the 
video is summarized using a small number of keyframes se¬ 
lected based on some criterion, such as low-level features, 
like pixel data, motion features, optical fiow and frame dif¬ 
ferences ciEiia, or higher-level information, like objects and 



Fig. 1. Deep co-regularized keyframe summary. Our method 
extracts diverse, representative and attractive keyframes. 


faces nia. For this class of algorithms, clustering techniques 
like k-means are popular: clustering or grouping is performed 
based on raw RGB pixels, or a combination of low and high 
level features |[6l[7l|8l[9l[T0l. The frames closest to the cluster 
centers are chosen to be part of the summary. 

Skimming-based summarization is used to produce longer 
video summaries. The video is divided into smaller shots us¬ 
ing shot boundary detection algorithms and a series of shots 
are selected to form the summary video. Subshot selection 
is based on motion activity (TTJ [121 (13 and other high level 
features, such as person and landmark descriptors C3. 

Finally, in storyboard-based summarization, algorithms 
take into account relationships between the different sub¬ 
shots El. This enables long egocentric videos to be summa¬ 
rized to gain an understanding of the underlying events. 

Contributions. This work focuses on generating compact 
keyframe-based summarization, with the main contributions 
are as follows: 

• A comprehensive keyframe-based summarization frame¬ 
work combining deep convolutional neural networks 
(DCNNs) and restricted Boltzmann machines (RBMs). 

• A co-regularization scheme for restricted RBMs able 
to learn joint high-level subject-scene representations. 

• A comprehensive user study comparing our method 
against various schemes including the algorithm in use 
by the video sharing website Daily motion. 

2. CO-REGULARIZED DEEP REPRESENTATIONS 
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A good keyframe-based summary should consist of easily 
recognizable subjects in context-setting scenes. To achieve 
this, we generate frame-level descriptions by exploiting deep 










convolutional architectures to recognize subjects and scenes. 
Compact representations are then computed with a novel 
co-regularization unsupervised learning scheme to exhibit 
the high-level associations between subjects and scenes. 
Keyframes are subsequently generated from these compact 
representations. 

2.1. Deep Convolutional Neural Networks 

Deep convolutional neural networks (DCNNs) have recently 
been used to obtain astonishing performances in both im¬ 
age classification IHEl and image retrieval ca tasks. 
For every frame sampled from the video at regular inter¬ 
vals, DCNN descriptors are extracted using the open source 
Caffe framework DU along with two pre-trained networks: 
VGG-ILSVRC-2014-D (20l and Places-CNN lED- 

VGG-ILSVRC-2014-D is the best performing single net¬ 
work from the VGG team during the ILSVRC 2014 image 
classification and localization challenge using the ImageNet 
{2^ dataset. This 138 million parameters network is made 
of 16 layers: 13 convolutional layers followed by 3 fully- 
connected layers. It detects 1000 mostly subject-centric cate¬ 
gories (e.g. animals, objects, plants, etc...). 

Places-CNN is a 60 million parameters network following 
the AlexNet ifT^ structure: a total of 8 layers: 5 convolutional 
layers followed by 3 fully-connected layers. It is trained on 
the Places 205 dataset, a scene-centric image dataset featuring 
205 categories including indoors and outdoors sceneries. 

For both DCNNs, descriptors are extracted from the last 
layer before the softmax operation, having a dimensionality 
of 1000 and 205 for VGG-ILSVRC-2014-D and Places-CNN, 
respectively. 

2.2. Co-Regularized Restricted Boltzmann Machines 

To create the video summaries from the DCNN descriptions 
of subjects Xo and scenes x^, we introduce a pair of concur¬ 
rently trained restricted Boltzmann machines (RBMs) to learn 
their projections (z^ and Zp) to K units each, where K is the 
desired number of keyframes. An RBM is a bipartite network 
with a projection matrix W that maps between its input and 
output units. RBMs are trained through gradient descent on 
the approximate maximum likelihood objective, based on net¬ 
work states drawn from Gibbs sampling 1^1^ . 

In this work, we introduce co-regularization for RBMs. 
The object RBM is regularized by place representations and in 
turn regularizes the training of the place RBM (Figure [2(^ . 
Given randomly sampled minibatches of subject and scene 
DCNN descriptors we introduce co-regularization 

cross entropy penalties to the RBM objective functions: 

argmin-^ ^ (logP(x;, z*)-Aoy] logP(5* ^ 15* ;,)Y 
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(a) Training co-regularized RBMs. 
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(b) Generating a frame descriptor from co-regularized RBMs. 


Fig. 2. A pair of co-regularized RBMs - one representing 
subjects and another representing scenes - are learned con¬ 
currently. [^During training, an subject unit is regularized 
by its corresponding scene unit and vice versa. |(b)| The frame 
descriptor is a linear combination of the two co-regularizated 
RBM descriptors forming relevant subject-scene associations. 


argmin-y] (log^K’I 4,fe)) ’ 

( 2 ) 

where {Zl,Zp}i are the RBM projections of {A'^,A'^}i, 
{Ao,Ap} are the regularization constants, and ^p,k}i 

refer to unit k in the distribution-sparsified representations 
of the minibatch 1^ . Sparsity across units helps avoid co¬ 
adaptation between the units and improves representational 
diversity across instances of frames. The co-regularization 
terms serves the purpose of binding a subject and the scene in 
which it occurs to the same unit position. 

The frame descriptor is a linear combination of the two 
RBM descriptors (Figure [2(b^ . The final set of keyframe tim¬ 
ings tk,k e [l-.AT] is the ordered set of K timings that gives 
the maximum response for each unit of the frame descriptor: 

argmax azl f. + (1 - a)zl ,., (3) 

where a G [0,1] is a balance hyperparameter that causes the 
summary to be more subject-centric or scene-centric. 

This proposed co-regularization method is not specific to 
subjects or scenes, and is generalizable to other concepts or 
modalities, such as faces or activities. 





































3. VIDEO SUMMARIZATION 


polar bear tiger shark impala camel grey whale black stork 
king penguin sting ray gazelle hartebeest sea lion spoonbill 


Using our method, we summarized all 11 episodes from the 
BBC educational TV series Planet Eartl^ Each episode is 
approximately 50 minutes long. A sample of our results is 
shown in Figure [5(a)| 

3.1. Model Visualization 

3.1.1. Balancing Subject- and Scene-Centricity 

As shown in Figurebias towards subject or scenes can be 
adjusted by tuning the a parameter from Equation This 
flexibility allows for interesting functionalities such as cus¬ 
tomising content based on user profiling or explicit queries. 
The choice of a value can also be made independently for 
each unit in order to generate the most visually attractive 
keyframe, for example based on vibrancy. In practice, setting 
the default value to a = 0.5 (as used in this empirical study) 
seems to produce satisfactory results. 
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iceberg underwater field desert ocean swamp 

snowfield coral reef desert canyon coast marsh 


jay platypus snow leopard capuchin lynx goose 


Subject 

units 

Scene 

units 


sky creek snowfield botanical gdn. rainforest marsh 


bulbul crocodile ibex chimpanzee leopard albatross 



rainforest river snowy mtn. rainforest forest path watering hole 


Fig. 4. Visualization of the units for a if = 12 model. The vi¬ 
sual representations of subject-scene pairs are well correlated. 
The categories of the two models are associated in a sensible 
way and correspond well with the visual representations. 



The top 2 categories of each unit identified from the 
weight matrices are also shown in Figure We notice that 
the correlation with the visual representation is strong and 
the subject-scene association is sensibly learned. We can 
also observe an interesting effect of co-regularization, where 
associations can be made between subjects (e.g. polar bear 
and king penguin) that occur in the same scene {iceberg) but 
never within the same frame. 

3.2. User Engagement Study 

3.2.1. Evaluation Eramework 


Fig. 3. Actual keyframes selected by varying a. Our model 
can be tuned to select keyframes that are more subject-centric 
(left), scene-centric (right) or a balance of both (middle). 


3.1.2. Visualisation of Co-Regularized RBM Units 

Although neural networks tend to be thought of as black 
boxes, visualization is often useful to dechipher what has 
been learned (261. To better understand our co-regularized 
model, we analysed the responses of each unit across the 
dataset. For this analysis, we trained a single if = 12 model 
across all 11 episodes. For each of the 24 RBM units, the 
top 100 frames that most strongly activate each unit were 
aggregated via a weighted average. The resulting graphical 
representation of each unit is shown on Figure We ob¬ 
serve that the visual appearances of frames corresponding to 
a subject-scene pair of units are consistently similar. There is 
also diversity across the units within an RBM. 

^http://www.bbc.co.uk/programmes/b 006 mywy 


Our method is compared against three other keyframe-based 
summarisation schemes: naive uniform sampling, k-means 
clustering and the method currently in use by the video shar¬ 
ing website Dailymotioi^ Each summary is presented as a 
timeline of keyframes as shown on Figure 

Uniform sampling takes k keyframes with evenly spaced 
timestamps: ti = ^ 3-i) U G [l..if] where d is the total 

duration of the video. The k-means clustering scheme uses 
frames sampled at the same frequency as for our method (1 
fps) and down-sized to 32 x 32 RGB pixels. Floyd’s algo¬ 
rithm (271 is used to separate the data into if clusters. 100 
runs with different centroid seeds are performed to mitigate 
the effects of local minima. For each cluster, the frame clos¬ 
est to its centroid is selected as keyframe. Daily motion pro¬ 
poses an 8-keyframes video summary (excluding title frame) 
which was used as a blackbox scheme to compare our method 
against. The evaluation videos were uploaded on the website 
and the proposed summary keyframes were then handpicked 
from the original footages. 

^http://www.dailymotion.com/ 



































































(a) Our method 



(b) Uniform sampling 



(d) Dailymotion 


Fig. 5. Eight keyframes summaries for episode 1 from the TV series Planet Earth. 


The study was performed by showing pairs of summaries 
- our method against one of the three baseline schemes - 
to eight different testers who have not previously seen the 
videos. For each pair, they are asked to answer the two fol¬ 
lowing questions: 

• Ql: Which video would you rather watch? 
(attractiveness) 

• Q2: Which summary was more informative? 

(informativeness) 

Using all the 11 Planet Earth episodes, summaries were 
generated for different amount of keyframes iT = 4, 6, 8, ex¬ 
cept for Dailymotion which imposes iT = 8 by default. In 
total, 8xllx2x3x +8 x 11 = 616 answers were collected 
for each question. 

Uniform sampling appears as a natural choice for the 
wildlife documentaries used during this study given to the 
slow pace of the action and high visual appeal of the average 
frame. K-means is expected to be able to capture the diversity 
of the scenes well whereas it may not perform as well with 
respect to subjects. 

3.2.2. Results and Discussion 

Table aggregates the answers from the testers. Overall, 
our method was systematically found more attractive (75% 
to 97.73% of the time) and more informative (76.14% to 
94.32%). Perceived attractiveness and informativeness are 
strongly correlated. Against Dailymotion's algorithm, our 
method scores favourably more than three times out of four 
representing a marked improvement over the scheme cur¬ 
rently used by the service. 


Table 1. How often our method is preferred over each of the 
three schemes (percentage) for different K. 



uniform 

k-means 

daily. 

K 

4 

6 

8 

4 

6 

8 

8 

Ql 

79.55 

82.95 

76.14 

97.73 

82.95 

75.00 

77.27 

Q2 

78.41 

80.68 

81.82 

94.32 

80.68 

76.14 

78.41 


For varying amounts K of keyframes, the improvement 
is rather consistent against uniform sampling whereas agains 
k-means, the improvement is more pronounced when K is 
smaller. This is an indication that our overall good method is 
particularly well-suited for compact summaries. 

4. CONCLUSIONS 

Building upon recent advances in deep learning and image 
recognition, we proposed a comprehensive keyframe-based 
summarization framework combining DCNNs and RBMs. 
Through a comprehensive empirical study, we showed that 
our method is able to out perform a number of existing 
schemes. In addition, our novel co-regularization scheme, 
which discovered meaningful subject-scene associations is 
generalizable to other concepts and modalities. 

Beyond the selection of quality keyframes, our contribu¬ 
tion represents a strong step towards the Holy Grail of text- 
based video summaries by introducing highly interpretable 
semantic representations. 
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